Length-independent vector-space document similarity measures based on regression residuals

نویسنده

  • Derrick Higgins
چکیده

One aspect of vector-space semantic similarity estimates which has so far received little attention is their dependence on the length of the texts to be compared. A simple experiment will demonstrate this effect. The data set used for this demonstration involves texts from the Lexile collection, a set of general fiction titles spanning a wide range of grade-school reading levels, which ETS licenses from the Metametrics Corporation. 400 documents were selected randomly from this collection, and truncated so that 100 included only the first 500 word tokens, 100 included only the first 1000 words, 100 included only the first 5000 words, and the final 100 included only the first 10,000 words. Two sets of similarity scores were calculated for each pair of documents in this collection (excluding duplicates). The first set of similarity scores was created using a simple content vector analysis (CVA) model with tf*idf weighting, with log term weights and an inverse document frequency term equal to log( NDocs DocFreqk ) for each term k, where the document frequency estimates were derived from the TASA corpus of high-school level texts on a variety of academic subjects. The second set of similarity scores was created using a Random Indexing (RI) (Sahlgren, 2006) model with similar parameters as those used for the CVA model. The RI model used co-occurrence of words within the same document in the TASA corpus as the basis for dimensionality reduction, and also used the TASA corpus to estimate inverse document frequency values for individual terms. Document vectors were produced as the tf*idf -weighted sum of term vectors occurring within the document, again with log weighting of both term frequencies and inverse document frequencies. For each of these methods, cosine was used as the similarity metric. It turns out that the similarity scores calculated by these methods for the Lexile data set are highly correlated with the variable gTypes, the geometric mean of the number of word types in the two documents to be compared. The correlation between CVA similarity and gTypes is 0.86, while the correlation between RI similarity and log(gTypes) is 0.89. At least in part, this dependency of similarity on length is due to statistical properties of vector-space semantic techniques themselves, rather than to particular differences in text composition between short and long documents. As the documents increase in length, the law of large numbers indicates that their semantic vectors will converge to the mean vector of the distribution, and therefore that the similarity between vectors will converge to sim(~ dmean, ~ dmean) = 1. Even if the documents are not drawn from the same distribution, or on similar topics, we can assume that document topics are not so sharply delimited in terms of their vocabulary that the vectors for two different documents will tend to converge to orthogonal vectors as they increase in length.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new vector valued similarity measure for intuitionistic fuzzy sets based on OWA operators

Plenty of researches have been carried out, focusing on the measures of distance, similarity, and correlation between intuitionistic fuzzy sets (IFSs).However, most of them are single-valued measures and lack of potential for efficiency validation.In this paper, a new vector valued similarity measure for IFSs is proposed based on OWA operators.The vector is defined as a two-tuple consisting of ...

متن کامل

A novel method for detecting structural damage based on data-driven and similarity-based techniques under environmental and operational changes

The applications of time series modeling and statistical similarity methods to structural health monitoring (SHM) provide promising and capable approaches to structural damage detection. The main aim of this article is to propose an efficient univariate similarity method named as Kullback similarity (KS) for identifying the location of damage and estimating the level of damage severity. An impr...

متن کامل

Pii: S0306-4573(00)00027-3

In this paper two distinct similarity measures in a document vector space, the distance-based and anglebased similarity measures, are compared, and a newly developed similarity measure based upon both the distance and angle strengths of two compared objects is presented. The concept of the iso-extent contour, which facilitates the understanding of the nature of the newly developed similarity me...

متن کامل

Performance Analysis of Layered Vector Space Model in Web Information Retrieval

Information on the web is growing exponentially. The unprecedented growth of available information coupled with the vast number of available online activities. It has introduced a new wrinkle to the problem of web search. It is difficult to retrieve relevant information. In this context search engines have become a valuable tool for users to retrieve relevant information. Finding relevant infor...

متن کامل

Similarity Measures in Documents Using Association Graphs

In this paper we present a new model, designated as Association Graph, to improve document representation, facilitating the ontological dimension. We explain how to generate and use this kind of graph. Also, we analyze different document similarity measures based on this representation. A classical vector space model was used to evaluate this model and measures, investigating their strengths an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008